Learning Objectives

  • Further your understanding of R and the R Studio interface
  • Learn the basics in data manipulation using the dplyr package
  • Learn to use help() to look up function documentation
  • Run code from a script

Introduction to Data Management

Now that you’ve learned the basics of R programming, we’ll take things a step further and start working on your skills related to data analysis. You will likely be unfamiliar with some of the operations you need to execute in this exercise. Part of the goal with this exercise, however, is for you to become more familiar with the help commands in R and with the internet solutions that exist. Our ultimate goal is to make you aware of the tools that are available so that you can become an effective problem solver, working independently on data analyses.

Running Code in a Script

So far, we’ve run code in the Console. This is fine for quick queries. For anything to be shareable, reproducible, and to reduce our own efforts in the long-term, we should run code via a script. An R script is a just a simple text file. R-Studio uses the script by copying R commands from text in the file and pastes them into the Console as if you were manually entering the commands yourself. This greatly enhances our ability to build off what we have created in the past, learn from previous experience, and quickly re-run analyses when new data are received. To create an R script:

  1. From the “file” mean, select “New File”
  2. Click “R Script from the list of options

R-Studio will open your R script automatically after creating it. Notice that the scripting window appears above the Console in what’s known as the Source pane.

  1. Save the R script in your working directly in the Scripts folder. Name this file IntroDataMgmt.R.

Figure 1: Create a script for running analyses
At the top of your script, provide brief information that describes the content of your script. The content is up to you, but should briefly identify who created the file, when the file was created, and what the script does. This will help when you return to the file at a later date or if you decide to share the file with a colleague in the future. Remember that anything after the # symbol is a comment. Use this symbol to make your code more readable, similar to below.

# ******************************************************************
# ******************************************************************

# Project: Introduction to Data Analysis in R
# Description: A script which details some basic commands on how to manipulate data
# Author: <Your Name>
# Date Initialized: <dd month yyyy>

# ******************************************************************
# ******************************************************************

Running an R Script

Running code via an R script is different than running code in the Console. To interpret and run the code you’ve written, R needs you to send the code from the script to the Console. Some common ways to run your code include:

  1. Highlight the line of code you want to run and use the shortcut Ctrl + Enter (Windows) or command + return (Mac).
  2. Highlight your code and physically click the Run button in the top right of the Source pane with your mouse.

Clearing your Workspace

You’ll find that your Environment (Workspace) in the upper right panel will quickly become full with user-defined objects. It’s generally good practice to work with a clean Workspace when starting a session. I generally start all my scripts with the following command to make sure you are starting fresh, something we will do to help develop good programming practices and reduce clutter.

# Clean your workspace/remove all objects
rm(list=ls())

# You can also remove a specific dataset using the following command
#rm(dataset)

Data Table Manipulation with Dplyr

The most basic R skill is to query and manipulate data tables. As a beginner programmer, it is imperative to familiarize yourself with how to manipulate data. Reinforcing these skills is like expanding your vocabulary in the new language that you are learning and is a great way to improve your R proficiency. If you wish to become really good at R, but don’t know where to start, start with data table manipulation!

The base R functions that come with the default R installation have the capacity for almost all the table manipulation needs (e.g., split(), subset(), apply(), sapply(), lapply(), tapply(), aggregate()). However, sometimes their syntax are less user-friendly and intuitive than some of the special packages built for table manipulation purposes. So, here we are introducing a few of the most useful table manipulation functions within dplyr package.

Note that you will have to use install.packages() and library() function to download and activate the dplyr before using it. You only need to install the package once on your computer. You will need, however, to ‘activate’ the package any time you want to use the functions that exist within the package.

#install.packages("dplyr")
library(dplyr)

Reading/Importing Data

R has multiple functions for reading in table data. Here we’ll use the base function read.csv() to import a table named panda_data.csv that is located in your Data folder. Text files (.txt) can be imported using the function read.delim(). See the help files for each function and search Google for information on other functions to read other data types.

View the first few rows of the data table using the function head() or click on the dataframe in the Environment/History panel.

# Read dataset
panda_data <- read.csv(file="Data/panda_data.csv")

# Look at the data
head(panda_data)
##   panda_name      ID age weight_kg sex base genetic_value1 genetic_value2
## 1      da_da 4415463   5       100   m   CD             99             50
## 2    mao_mao 4415522   4       120   f   CD             70             30
## 3    lan_lan 4416073   7        95   f   WL             80             NA
## 4    bei_bei 4416405   5       120   f   WL             80             NA
## 5    bao_bao 4417779   5       110   m   WL             75             60
## 6  tian_tian 4424490   4        89   m   WL             95             70
##   genetic_value3 year
## 1             60 2017
## 2             90 2017
## 3             80 2017
## 4             65 2017
## 5             76 2017
## 6             87 2017

Questions:

  1. Can you change the number of rows that are displayed with the head() function?
  2. How would you print the first 3 rows?
  3. How would you print rows 4 thru 6?
  4. What are the dimensions of the dataframe? How many rows and columns are there?
  5. Can you guess how you might look at the last few rows of the dataframe?
# Print the first 3 rows
head(panda_data, n = 3)
##   panda_name      ID age weight_kg sex base genetic_value1 genetic_value2
## 1      da_da 4415463   5       100   m   CD             99             50
## 2    mao_mao 4415522   4       120   f   CD             70             30
## 3    lan_lan 4416073   7        95   f   WL             80             NA
##   genetic_value3 year
## 1             60 2017
## 2             90 2017
## 3             80 2017
# Print rows 4 thru 6
panda_data[4:6,]
##   panda_name      ID age weight_kg sex base genetic_value1 genetic_value2
## 4    bei_bei 4416405   5       120   f   WL             80             NA
## 5    bao_bao 4417779   5       110   m   WL             75             60
## 6  tian_tian 4424490   4        89   m   WL             95             70
##   genetic_value3 year
## 4             65 2017
## 5             76 2017
## 6             87 2017
# Dataframe dimensions
dim(panda_data)
## [1] 10 10
nrow(panda_data)
## [1] 10
ncol(panda_data)
## [1] 10
# Print the tail
tail(panda_data)
##       panda_name      ID age weight_kg sex base genetic_value1 genetic_value2
## 5        bao_bao 4417779   5       110   m   WL             75             60
## 6      tian_tian 4424490   4        89   m   WL             95             70
## 7        wei_wei 4424967   7        98   f   CD             60             NA
## 8  shuang_shuang 4424657   8       110   m   WL             92             NA
## 9         qiuqiu 4423758   9       130   f   CD             81             50
## 10     lang_lang 4427758  10       110   m   WL             60             NA
##    genetic_value3 year
## 5              76 2017
## 6              87 2017
## 7              92 2017
## 8              50 2017
## 9              30 2018
## 10             70 2018

Selecting Columns

The function select() is a powerful tool for selecting columns of interest. You must specify the dataset you want to query and then also provide an expression for selecting columns of interest (select(.data, expression)). A few examples are provided below:

# select column called panda_name
select(panda_data, panda_name) 
##       panda_name
## 1          da_da
## 2        mao_mao
## 3        lan_lan
## 4        bei_bei
## 5        bao_bao
## 6      tian_tian
## 7        wei_wei
## 8  shuang_shuang
## 9         qiuqiu
## 10     lang_lang
# select all columns in the data except panda_name
select(panda_data, -panda_name)
##         ID age weight_kg sex base genetic_value1 genetic_value2 genetic_value3
## 1  4415463   5       100   m   CD             99             50             60
## 2  4415522   4       120   f   CD             70             30             90
## 3  4416073   7        95   f   WL             80             NA             80
## 4  4416405   5       120   f   WL             80             NA             65
## 5  4417779   5       110   m   WL             75             60             76
## 6  4424490   4        89   m   WL             95             70             87
## 7  4424967   7        98   f   CD             60             NA             92
## 8  4424657   8       110   m   WL             92             NA             50
## 9  4423758   9       130   f   CD             81             50             30
## 10 4427758  10       110   m   WL             60             NA             70
##    year
## 1  2017
## 2  2017
## 3  2017
## 4  2017
## 5  2017
## 6  2017
## 7  2017
## 8  2017
## 9  2018
## 10 2018
# select a range of columns, from age to sex
select(panda_data, age:sex)
##    age weight_kg sex
## 1    5       100   m
## 2    4       120   f
## 3    7        95   f
## 4    5       120   f
## 5    5       110   m
## 6    4        89   m
## 7    7        98   f
## 8    8       110   m
## 9    9       130   f
## 10  10       110   m

Various selection helpers also exist, including:

  • starts_with: Expression select multiple columns that start with the same text.
  • ends_with(): Expression to select columns that end with the same text.
  • contains(): Expression to select columns that contain the same text.
  • matches(): Expression to select columns that match a regular expression.
  • one_of(): Expression to select columns that are from a group of names.
# select all columns that start with "genetic" in their column names
select(panda_data, starts_with("genetic")) 
##    genetic_value1 genetic_value2 genetic_value3
## 1              99             50             60
## 2              70             30             90
## 3              80             NA             80
## 4              80             NA             65
## 5              75             60             76
## 6              95             70             87
## 7              60             NA             92
## 8              92             NA             50
## 9              81             50             30
## 10             60             NA             70

Questions:

  1. Select all the columns that contain “value” in the column name.
  2. Select or exclude two columns: panda_name and age.
# Select all the columns that contain "value" in the column name.
select(panda_data, contains("value")) 
##    genetic_value1 genetic_value2 genetic_value3
## 1              99             50             60
## 2              70             30             90
## 3              80             NA             80
## 4              80             NA             65
## 5              75             60             76
## 6              95             70             87
## 7              60             NA             92
## 8              92             NA             50
## 9              81             50             30
## 10             60             NA             70
# Select or exclude two columns: `panda_name` and `age`. 
select(panda_data, c(panda_name, age)) 
##       panda_name age
## 1          da_da   5
## 2        mao_mao   4
## 3        lan_lan   7
## 4        bei_bei   5
## 5        bao_bao   5
## 6      tian_tian   4
## 7        wei_wei   7
## 8  shuang_shuang   8
## 9         qiuqiu   9
## 10     lang_lang  10
select(panda_data, !c(panda_name, age))
##         ID weight_kg sex base genetic_value1 genetic_value2 genetic_value3 year
## 1  4415463       100   m   CD             99             50             60 2017
## 2  4415522       120   f   CD             70             30             90 2017
## 3  4416073        95   f   WL             80             NA             80 2017
## 4  4416405       120   f   WL             80             NA             65 2017
## 5  4417779       110   m   WL             75             60             76 2017
## 6  4424490        89   m   WL             95             70             87 2017
## 7  4424967        98   f   CD             60             NA             92 2017
## 8  4424657       110   m   WL             92             NA             50 2017
## 9  4423758       130   f   CD             81             50             30 2018
## 10 4427758       110   m   WL             60             NA             70 2018

Filtering Data

Filter() is similar to select(), except that you are selecting specific rows that satisfy a requirement based on a column value. This function is very similar to the base function subset(). A few examples on how to use filter() are provided below:

Artwork by Allison Horst
Artwork by Allison Horst


# Select rows where pandas are greater than or equal to 5 years of age
filter(panda_data, age >= 5)
##      panda_name      ID age weight_kg sex base genetic_value1 genetic_value2
## 1         da_da 4415463   5       100   m   CD             99             50
## 2       lan_lan 4416073   7        95   f   WL             80             NA
## 3       bei_bei 4416405   5       120   f   WL             80             NA
## 4       bao_bao 4417779   5       110   m   WL             75             60
## 5       wei_wei 4424967   7        98   f   CD             60             NA
## 6 shuang_shuang 4424657   8       110   m   WL             92             NA
## 7        qiuqiu 4423758   9       130   f   CD             81             50
## 8     lang_lang 4427758  10       110   m   WL             60             NA
##   genetic_value3 year
## 1             60 2017
## 2             80 2017
## 3             65 2017
## 4             76 2017
## 5             92 2017
## 6             50 2017
## 7             30 2018
## 8             70 2018
# select rows that have age>5 OR weight_kg >100 
filter(panda_data, age > 5 | weight_kg > 100)
##      panda_name      ID age weight_kg sex base genetic_value1 genetic_value2
## 1       mao_mao 4415522   4       120   f   CD             70             30
## 2       lan_lan 4416073   7        95   f   WL             80             NA
## 3       bei_bei 4416405   5       120   f   WL             80             NA
## 4       bao_bao 4417779   5       110   m   WL             75             60
## 5       wei_wei 4424967   7        98   f   CD             60             NA
## 6 shuang_shuang 4424657   8       110   m   WL             92             NA
## 7        qiuqiu 4423758   9       130   f   CD             81             50
## 8     lang_lang 4427758  10       110   m   WL             60             NA
##   genetic_value3 year
## 1             90 2017
## 2             80 2017
## 3             65 2017
## 4             76 2017
## 5             92 2017
## 6             50 2017
## 7             30 2018
## 8             70 2018
# select rows that have age>5 AND base column has CD has entry
filter(panda_data, age > 5 & base == "CD") 
##   panda_name      ID age weight_kg sex base genetic_value1 genetic_value2
## 1    wei_wei 4424967   7        98   f   CD             60             NA
## 2     qiuqiu 4423758   9       130   f   CD             81             50
##   genetic_value3 year
## 1             92 2017
## 2             30 2018
# Select rows where the panda age is defined by a few values.  Note the use of the %in% function.
filter(panda_data, age %in% c(4,5,7))
##   panda_name      ID age weight_kg sex base genetic_value1 genetic_value2
## 1      da_da 4415463   5       100   m   CD             99             50
## 2    mao_mao 4415522   4       120   f   CD             70             30
## 3    lan_lan 4416073   7        95   f   WL             80             NA
## 4    bei_bei 4416405   5       120   f   WL             80             NA
## 5    bao_bao 4417779   5       110   m   WL             75             60
## 6  tian_tian 4424490   4        89   m   WL             95             70
## 7    wei_wei 4424967   7        98   f   CD             60             NA
##   genetic_value3 year
## 1             60 2017
## 2             90 2017
## 3             80 2017
## 4             65 2017
## 5             76 2017
## 6             87 2017
## 7             92 2017

Questions:

1: Select rows with NA in the genetic_value2 column.
2: Select rows whose panda_name column are bao_bao or bei_bei.

# Select rows with `NA` in the genetic_value2 column. 
filter(panda_data, is.na(genetic_value2)) 
##      panda_name      ID age weight_kg sex base genetic_value1 genetic_value2
## 1       lan_lan 4416073   7        95   f   WL             80             NA
## 2       bei_bei 4416405   5       120   f   WL             80             NA
## 3       wei_wei 4424967   7        98   f   CD             60             NA
## 4 shuang_shuang 4424657   8       110   m   WL             92             NA
## 5     lang_lang 4427758  10       110   m   WL             60             NA
##   genetic_value3 year
## 1             80 2017
## 2             65 2017
## 3             92 2017
## 4             50 2017
## 5             70 2018
# Select rows whose panda_name column are `bao_bao` or `bei_bei`
filter(panda_data, panda_name == 'bao_bao' | panda_name == 'bei_bei') 
##   panda_name      ID age weight_kg sex base genetic_value1 genetic_value2
## 1    bei_bei 4416405   5       120   f   WL             80             NA
## 2    bao_bao 4417779   5       110   m   WL             75             60
##   genetic_value3 year
## 1             65 2017
## 2             76 2017

Using the Pipe Operator

Piping (%>%) allows the user to combine the output from one function to the input of another. Thus, instead of nesting functions (reading from the inside to the outside), piping reads functions from left to right (i.e., the way we normally read things). As a result, reading piped code can be more intuitive and can help avoid creating and saving a lot of intermediate variables that you don’t need.

# Use a simple pipe to select the panda name and it's sex, and output the result.
# Create a new object named 'pipe_result'
pipe_result <- panda_data %>%
  select(panda_name, sex) %>%
  head()

# Output the result to the screen 
pipe_result
##   panda_name sex
## 1      da_da   m
## 2    mao_mao   f
## 3    lan_lan   f
## 4    bei_bei   f
## 5    bao_bao   m
## 6  tian_tian   m

Questions:

  1. Use a pipe operator to select all columns containing “genetic” in their names, but only select rows that have a genetic_value1 >80 AND genetic_value2 <90. Don’t create a new object when you create the query (i.e., just print the result to your screen).
  2. What do you think will happen if you filter based on a field that hasn’t been selected? For instance, select all columns containing “genetic”, but select the rows where animals weigh > 90 kg.
# Select columns that contain genetic and then selection values based on these columns
panda_data %>% 
  select(contains("genetic")) %>% 
  filter(genetic_value1 > 80 & genetic_value2 < 90)
##   genetic_value1 genetic_value2 genetic_value3
## 1             99             50             60
## 2             95             70             87
## 3             81             50             30
# Example of filtering a column that doesn't exist after selecting
# The field doesn't exist and will cause an error
# panda_data %>% 
#   select(contains("genetic")) %>% 
#   filter(weight_kg > 90)

Mutate

One of the most useful functions in dplyr package is mutate(). Mutate allows the user to create new column(s), populating the columns with values that you define or from information in columns that already exist. You can also use mutate() to control which columns are retained in the new object that you create by setting the .keep argument.

Artwork by Allison Horst
Artwork by Allison Horst


# Create a new column, based on values from other columns that exist
# By default, keep = all
new_col_ex1 <- panda_data %>%
  mutate(genetic_value_new = genetic_value1 - genetic_value2,
         .keep = "all")  
# Print
new_col_ex1
##       panda_name      ID age weight_kg sex base genetic_value1 genetic_value2
## 1          da_da 4415463   5       100   m   CD             99             50
## 2        mao_mao 4415522   4       120   f   CD             70             30
## 3        lan_lan 4416073   7        95   f   WL             80             NA
## 4        bei_bei 4416405   5       120   f   WL             80             NA
## 5        bao_bao 4417779   5       110   m   WL             75             60
## 6      tian_tian 4424490   4        89   m   WL             95             70
## 7        wei_wei 4424967   7        98   f   CD             60             NA
## 8  shuang_shuang 4424657   8       110   m   WL             92             NA
## 9         qiuqiu 4423758   9       130   f   CD             81             50
## 10     lang_lang 4427758  10       110   m   WL             60             NA
##    genetic_value3 year genetic_value_new
## 1              60 2017                49
## 2              90 2017                40
## 3              80 2017                NA
## 4              65 2017                NA
## 5              76 2017                15
## 6              87 2017                25
## 7              92 2017                NA
## 8              50 2017                NA
## 9              30 2018                31
## 10             70 2018                NA
# You can create multiple columns at once.  Best to put each new column on a separate line.  This simply makes the code more readable.
new_col_ex2 <- panda_data %>%
    mutate(genetic_dif = genetic_value1 - genetic_value2, 
           weight_g = weight_kg * 1000) 
# Print
new_col_ex2
##       panda_name      ID age weight_kg sex base genetic_value1 genetic_value2
## 1          da_da 4415463   5       100   m   CD             99             50
## 2        mao_mao 4415522   4       120   f   CD             70             30
## 3        lan_lan 4416073   7        95   f   WL             80             NA
## 4        bei_bei 4416405   5       120   f   WL             80             NA
## 5        bao_bao 4417779   5       110   m   WL             75             60
## 6      tian_tian 4424490   4        89   m   WL             95             70
## 7        wei_wei 4424967   7        98   f   CD             60             NA
## 8  shuang_shuang 4424657   8       110   m   WL             92             NA
## 9         qiuqiu 4423758   9       130   f   CD             81             50
## 10     lang_lang 4427758  10       110   m   WL             60             NA
##    genetic_value3 year genetic_dif weight_g
## 1              60 2017          49   100000
## 2              90 2017          40   120000
## 3              80 2017          NA    95000
## 4              65 2017          NA   120000
## 5              76 2017          15   110000
## 6              87 2017          25    89000
## 7              92 2017          NA    98000
## 8              50 2017          NA   110000
## 9              30 2018          31   130000
## 10             70 2018          NA   110000

Question:

  1. Create a new column in panda_data called zero and give it a value of 0?
# Create a new column called zero
panda_data %>%
  mutate(zero = 0)
##       panda_name      ID age weight_kg sex base genetic_value1 genetic_value2
## 1          da_da 4415463   5       100   m   CD             99             50
## 2        mao_mao 4415522   4       120   f   CD             70             30
## 3        lan_lan 4416073   7        95   f   WL             80             NA
## 4        bei_bei 4416405   5       120   f   WL             80             NA
## 5        bao_bao 4417779   5       110   m   WL             75             60
## 6      tian_tian 4424490   4        89   m   WL             95             70
## 7        wei_wei 4424967   7        98   f   CD             60             NA
## 8  shuang_shuang 4424657   8       110   m   WL             92             NA
## 9         qiuqiu 4423758   9       130   f   CD             81             50
## 10     lang_lang 4427758  10       110   m   WL             60             NA
##    genetic_value3 year zero
## 1              60 2017    0
## 2              90 2017    0
## 3              80 2017    0
## 4              65 2017    0
## 5              76 2017    0
## 6              87 2017    0
## 7              92 2017    0
## 8              50 2017    0
## 9              30 2018    0
## 10             70 2018    0

Arrange, Relocate, and Rename

It is common to receive data in ways that we want to change or alter. For instance, we might want to change the order of the columns, re-order the rows based on particular column values, or we may want to change the column headings. Mutate() includes options to change the order of columns that you create (see .before and .after in the help for the function). Relocate() can be used move columns around using the .before and .after commands, arrange() can be used to order the rows, and rename() can be used to give new names to column headings. By default, arrange will arrange the selecting column in ascending order. We can use desc() to arrange the column in descending order.

Artwork by Allison Horst
Artwork by Allison Horst


# Arrange the panda_data by panda_name in descending order
# Place the sex column after the ID column
order_data_ex1 <- panda_data %>% 
    arrange(desc(panda_name)) %>% 
  relocate(sex, .after = ID)
# Print
head(order_data_ex1)
##      panda_name      ID sex age weight_kg base genetic_value1 genetic_value2
## 1       wei_wei 4424967   f   7        98   CD             60             NA
## 2     tian_tian 4424490   m   4        89   WL             95             70
## 3 shuang_shuang 4424657   m   8       110   WL             92             NA
## 4        qiuqiu 4423758   f   9       130   CD             81             50
## 5       mao_mao 4415522   f   4       120   CD             70             30
## 6     lang_lang 4427758   m  10       110   WL             60             NA
##   genetic_value3 year
## 1             92 2017
## 2             87 2017
## 3             50 2017
## 4             30 2018
## 5             90 2017
## 6             70 2018
# Arrange the panda_data by genetic_value 1 and genetic_value2 after select the columns that start with 'genetic'
# Rename the genetic_value1 column to gen_val1
order_data_ex2 <- panda_data %>% 
    select(starts_with("genetic")) %>%
    arrange(genetic_value1, genetic_value2) %>% 
  rename(gen_val1 = genetic_value1)
# Print
head(order_data_ex2)
##   gen_val1 genetic_value2 genetic_value3
## 1       60             NA             92
## 2       60             NA             70
## 3       70             30             90
## 4       75             60             76
## 5       80             NA             80
## 6       80             NA             65

Question:

  1. Arrange the panda_data by age and then by panda weight, putting the year column before age and the base column after the last column? Rename the weight column to panda_wgt_kg.
# Arrange the panda dataset by age and weight, relocating columns and renaming the weight column
panda_data %>% 
  arrange(age, weight_kg) %>% 
  relocate(year, .before = age) %>% 
  relocate(base, .after = last_col()) %>% 
  rename(panda_wgt_kg = weight_kg)
##       panda_name      ID year age panda_wgt_kg sex genetic_value1
## 1      tian_tian 4424490 2017   4           89   m             95
## 2        mao_mao 4415522 2017   4          120   f             70
## 3          da_da 4415463 2017   5          100   m             99
## 4        bao_bao 4417779 2017   5          110   m             75
## 5        bei_bei 4416405 2017   5          120   f             80
## 6        lan_lan 4416073 2017   7           95   f             80
## 7        wei_wei 4424967 2017   7           98   f             60
## 8  shuang_shuang 4424657 2017   8          110   m             92
## 9         qiuqiu 4423758 2018   9          130   f             81
## 10     lang_lang 4427758 2018  10          110   m             60
##    genetic_value2 genetic_value3 base
## 1              70             87   WL
## 2              30             90   CD
## 3              50             60   CD
## 4              60             76   WL
## 5              NA             65   WL
## 6              NA             80   WL
## 7              NA             92   CD
## 8              NA             50   WL
## 9              50             30   CD
## 10             NA             70   WL

Summarise

Summary statistics are easily calculated using the summarise() function (note: summarize() will also work). The summary statistics will be calculated among all rows in the dataframe, unless specified otherwise. Summarize() is often used in combination with group_by() (see below for more information). Various useful summary functions are provided in the help documentation for the function.

# Calculate the mean weight of the pandas and provide a count.
sum_table <- panda_data %>% 
  summarise(mean_wgt = mean(weight_kg),
            n = n())
# Print summary table
sum_table
##   mean_wgt  n
## 1    108.2 10
# Calculate the avg weight and minimum weight.  Don't create a new object.
panda_data %>% 
  summarise(avg_wt = mean(weight_kg), 
            min_wt = min(weight_kg))
##   avg_wt min_wt
## 1  108.2     89

Grouping Functions

Grouping data together for class summaries has been improved in the latest version of dplyr by adding .by to the summarise() function. As a result, it is no longer necessary to use the function Group_by() in addition and/or external to summarise(). The .by argument allows you to group your columns of interest and generate a series of summary statistics.

# Similar to above, calculate average and minimum weight, but summarize based on 'base' column.  Include a count of each group.
panda_data %>%
  summarise(avg_wt = mean(weight_kg),
            min_wt = min(weight_kg),
            n = n(),
            .by = base)
##   base   avg_wt min_wt n
## 1   CD 112.0000     98 4
## 2   WL 105.6667     89 6
# Group summaries can also be calculated across muultiple groups.
# Here, we calculate the same as above, but based on base and sex
panda_data %>%
  summarise(avg_wt = mean(weight_kg),
            min_wt= min(weight_kg),
            n = n(), 
            .by = c(base, sex))
##   base sex avg_wt min_wt n
## 1   CD   m 100.00    100 1
## 2   CD   f 116.00     98 3
## 3   WL   f 107.50     95 2
## 4   WL   m 104.75     89 4

Joining Tables

The last of the important tools we will cover in dplyr is how to join tables together. This is common practice in ecological statistics, with data in one table that are required to be appended to another table for analyses. To join these tables, we need to link the tables based on shared columns.

As an example, we will import a table (panda_data_med.csv) located in our Data folder that summarizes the vaccination history of each our pandas. Similar to other functions, we have multiple options to join the tables together. The most common join to use is a left_join(). In this case, all the records in our first table (i.e., the left table) are linked with those that match or are shared with the second data table (i.e., the right table). The means that we will keep all the records in the left table and include only those records that match in the right table (they will be returned NA if no match exists).

Other join options, including inner_join(), right_join, and full_join() also exist. Read the help file to determine which join is most appropriate to accomplish the activity of interest.

Note: Sometimes the column headings between tables do not match, even if the data within the column does (ID in table 1 != id in table 2). In these cases, you can either rename() the column headings to make them match or specify columns you want to join.

# Read in the vaccination table
panda_med <- read.csv(file="Data/panda_data_med.csv")

# Join all the rows in table 1 (panda_data) with table 2 (panda_med) to determine which of the pandas were vaccinated.  Since ID exists in both tables, this is a straightforward join.
# Arrange the result by ID and year vaccinated
panda_join_ex1 <- panda_data %>% 
  left_join(panda_med, by = "ID") %>% 
  arrange(ID, year_vaccination)
# Print
head(panda_join_ex1)
##   panda_name.x      ID age weight_kg sex base genetic_value1 genetic_value2
## 1        da_da 4415463   5       100   m   CD             99             50
## 2      mao_mao 4415522   4       120   f   CD             70             30
## 3      mao_mao 4415522   4       120   f   CD             70             30
## 4      lan_lan 4416073   7        95   f   WL             80             NA
## 5      bei_bei 4416405   5       120   f   WL             80             NA
## 6      bao_bao 4417779   5       110   m   WL             75             60
##   genetic_value3 year panda_name.y year_vaccination     vaccine_type
## 1             60 2017        da_da             2018 canine distemper
## 2             90 2017      mao_mao             2017 canine distemper
## 3             90 2017      mao_mao             2018           rabies
## 4             80 2017         <NA>               NA             <NA>
## 5             65 2017         <NA>               NA             <NA>
## 6             76 2017         <NA>               NA             <NA>
# Example of how to join based on multiple fields and fields that don't exactly match
panda_join_ex2 <- panda_data %>%
  left_join(panda_med,
            by = c("ID" = "ID", 
                   "year"= "year_vaccination")) 
# Print
head(panda_join_ex2)
##   panda_name.x      ID age weight_kg sex base genetic_value1 genetic_value2
## 1        da_da 4415463   5       100   m   CD             99             50
## 2      mao_mao 4415522   4       120   f   CD             70             30
## 3      lan_lan 4416073   7        95   f   WL             80             NA
## 4      bei_bei 4416405   5       120   f   WL             80             NA
## 5      bao_bao 4417779   5       110   m   WL             75             60
## 6    tian_tian 4424490   4        89   m   WL             95             70
##   genetic_value3 year panda_name.y     vaccine_type
## 1             60 2017         <NA>             <NA>
## 2             90 2017      mao_mao canine distemper
## 3             80 2017         <NA>             <NA>
## 4             65 2017         <NA>             <NA>
## 5             76 2017         <NA>             <NA>
## 6             87 2017         <NA>             <NA>

Exercise

Now that you’ve been exposed to a few functions to better manage data in R, we will reinforce these tools by asking you to apply them to memory by practicing on a dataset of wildlife counts collected across the Naboisho conservancy. Located in your Data folder and named SampleData.csv, these data are derived from our publication in the African Journal of Ecology that forms the foundation for this project.

Upon import you will note that the dataset is comprised of 5 columns:

  • Date: The date the transect was conducted
  • Species: Species name
  • GroupSize: Total number of animals in the counted group
  • TransectID: Name of each transect
  • Year: Year in which the data were collected

Please create a script named Exercise1_DataMgmt.R. Format the script like we have instructed to do so in this lecture with your Name, Date, and Description. Then, read in the data and create a fully commented code to answer the following questions:

  1. What are the dimensions of the dataset? How many rows and columns exist?
  2. What is the data structure of each variable?
  3. How many species were counted?
  4. How many transects are there and what are the names of these transects?
  5. How many unique survey days were completed? What is the range of dates?

Please use a %>% for the followings questions (you don’t need to save the output):

  1. What is the maximum group size of giraffe?
  2. What is the mean group size of wildebeest per year? Include the number of observations in the calculation.
  3. What is the total count of impala observed in 2018?
  4. What is the total number of groups of each species per year?
  5. What is the total number of groups and the mean group size of each species, per year, and per transect? Sort these by Species, TransectID, and Year.

Note: We didn’t go over every function to answer these questions. You may need to troubleshoot and search the web to find the appropriate function. Remember to use R-Cran when searching to limit/fine tune your search. Using dplyr in your search term can also help in narrowing down your search.

The data used for this exercise can be referenced as:

Crego RD, Wells HBM, Connette G, Stabach JA, Soit N, Thompson S. 2023. Monitoring spatiotemporal dynamics of large herbivores across an African rangeland using hierarchical multi-species distance sampling. African Journal of Ecology. doi:10.1111/aje.13219

# Read in the CSV
survey_sample <- read.csv(file = "Data/SampleData.csv")

# Look at the data
head(survey_sample)
##         Date    Species GroupSize      TransectID Year
## 1 2017-11-02 Hartebeest         8 Balanite plains 2017
## 2 2018-01-11 Hartebeest         1 Balanite plains 2018
## 3 2018-01-11 Hartebeest         2 Balanite plains 2018
## 4 2018-01-11 Hartebeest         2 Balanite plains 2018
## 5 2018-02-01 Hartebeest         1 Balanite plains 2018
## 6 2018-08-02 Hartebeest         5 Balanite plains 2018
# Question 1: What are the dimensions of the dataset?  How many rows and columns?
dim(survey_sample)
## [1] 3679    5
nrow(survey_sample)
## [1] 3679
ncol(survey_sample)
## [1] 5
# Question 2: What is the data structure of each variable?
str(survey_sample)
## 'data.frame':    3679 obs. of  5 variables:
##  $ Date      : chr  "2017-11-02" "2018-01-11" "2018-01-11" "2018-01-11" ...
##  $ Species   : chr  "Hartebeest" "Hartebeest" "Hartebeest" "Hartebeest" ...
##  $ GroupSize : int  8 1 2 2 1 5 1 2 1 1 ...
##  $ TransectID: chr  "Balanite plains" "Balanite plains" "Balanite plains" "Balanite plains" ...
##  $ Year      : int  2017 2018 2018 2018 2018 2018 2018 2017 2017 2018 ...
# Question 3: How many species were counted?
n_distinct(survey_sample$Species)
## [1] 10
length(unique(survey_sample$Species))
## [1] 10
# Question 4: How many transects are there and what are the names of these transects?  Same process, but a different column.
n_distinct(survey_sample$TransectID)
## [1] 8
unique(survey_sample$TransectID)
## [1] "Balanite plains"        "Rekero bushlands"       "Enoolera plains"       
## [4] "Olare Sampu bushlands"  "Sampu Enkare bushlands" "Ilkisieusieu bushlands"
## [7] "Rekero plains"          "Payia plains"
# Question 5: How many unique survey days were completed?  What is the range of dates?
n_distinct(survey_sample$Date)
## [1] 37
length(unique(survey_sample$Date))
## [1] 37
min(survey_sample$Date)
## [1] "2017-10-17"
max(survey_sample$Date)
## [1] "2018-12-06"
range(survey_sample$Date)
## [1] "2017-10-17" "2018-12-06"
# Question 6: What is the maximum group size of giraffe?
survey_sample %>%
  filter(Species == "Giraffe") %>%
  summarise(max_grp = max(GroupSize))
##   max_grp
## 1      12
# Question 7: What is the mean group size of wildebeest per year?  Include the number of observations in the calculation.
survey_sample %>%
  filter(Species == "Wildebeest") %>%
  summarise(mean_grp = mean(GroupSize),
            n_obs = n(), # This counts the observations, it is not a sum of the observations.
            .by=Year)
##   Year  mean_grp n_obs
## 1 2017  8.305419   203
## 2 2018 10.794664   862
# Question 8: What is the total count of impala observed in 2018?
survey_sample %>%
  filter(Species == "Impala" & Year == "2018") %>%
  summarise(sum_impala = sum(GroupSize))
##   sum_impala
## 1       2182
# Question 9: What is the total number of groups of each species per year?  Note, not the total animals counted.
survey_sample %>%
  summarise(Total_Grps = n(),
            .by=c(Species, Year))
##              Species Year Total_Grps
## 1         Hartebeest 2017          3
## 2         Hartebeest 2018         21
## 3              Eland 2017          2
## 4              Eland 2018         32
## 5            Giraffe 2017         15
## 6            Giraffe 2018         62
## 7    Grant's gazelle 2017         56
## 8    Grant's gazelle 2018        119
## 9             Impala 2017        158
## 10            Impala 2018        299
## 11 Thomson's gazelle 2017        297
## 12 Thomson's gazelle 2018        736
## 13              Topi 2017         60
## 14              Topi 2018        171
## 15           Warthog 2017         39
## 16           Warthog 2018        167
## 17        Wildebeest 2017        203
## 18        Wildebeest 2018        862
## 19      Plains zebra 2017         64
## 20      Plains zebra 2018        313
# Question 10: What is the total number of groups and the mean group size of each species, per year, and per transect?  Sort these by Species, TransectID, and Year
survey_sample %>%
  summarise(Total_Grps = n(),
            Mean_Grps = mean(GroupSize),
            .by = c(Species, Year, TransectID)) %>%
  arrange(Species, TransectID, Year)
##               Species Year             TransectID Total_Grps  Mean_Grps
## 1               Eland 2018        Balanite plains          4   5.500000
## 2               Eland 2018        Enoolera plains          4   3.000000
## 3               Eland 2018 Ilkisieusieu bushlands          2  28.500000
## 4               Eland 2017  Olare Sampu bushlands          1   7.000000
## 5               Eland 2018  Olare Sampu bushlands          5   5.600000
## 6               Eland 2017       Rekero bushlands          1   1.000000
## 7               Eland 2018       Rekero bushlands          3   2.666667
## 8               Eland 2018          Rekero plains          6  19.666667
## 9               Eland 2018 Sampu Enkare bushlands          8   7.625000
## 10            Giraffe 2017        Balanite plains          2   1.500000
## 11            Giraffe 2018        Balanite plains          5   2.800000
## 12            Giraffe 2018        Enoolera plains          2   2.000000
## 13            Giraffe 2017 Ilkisieusieu bushlands          2   5.500000
## 14            Giraffe 2018 Ilkisieusieu bushlands         18   1.888889
## 15            Giraffe 2017  Olare Sampu bushlands          6   2.166667
## 16            Giraffe 2018  Olare Sampu bushlands          9   2.888889
## 17            Giraffe 2018           Payia plains          2   4.500000
## 18            Giraffe 2017       Rekero bushlands          3   2.000000
## 19            Giraffe 2018       Rekero bushlands         11   2.363636
## 20            Giraffe 2018          Rekero plains          3   5.666667
## 21            Giraffe 2017 Sampu Enkare bushlands          2   1.000000
## 22            Giraffe 2018 Sampu Enkare bushlands         12   3.000000
## 23    Grant's gazelle 2017        Balanite plains         15   4.133333
## 24    Grant's gazelle 2018        Balanite plains         26   3.923077
## 25    Grant's gazelle 2017        Enoolera plains          9   1.777778
## 26    Grant's gazelle 2018        Enoolera plains         16   2.937500
## 27    Grant's gazelle 2017 Ilkisieusieu bushlands          1   5.000000
## 28    Grant's gazelle 2018 Ilkisieusieu bushlands          9   3.000000
## 29    Grant's gazelle 2017  Olare Sampu bushlands         12   4.416667
## 30    Grant's gazelle 2018  Olare Sampu bushlands         10   3.500000
## 31    Grant's gazelle 2018           Payia plains          7   2.714286
## 32    Grant's gazelle 2017       Rekero bushlands         11   2.454545
## 33    Grant's gazelle 2018       Rekero bushlands         20   2.750000
## 34    Grant's gazelle 2017          Rekero plains          4   1.250000
## 35    Grant's gazelle 2018          Rekero plains         11   1.909091
## 36    Grant's gazelle 2017 Sampu Enkare bushlands          4   2.750000
## 37    Grant's gazelle 2018 Sampu Enkare bushlands         20   2.600000
## 38         Hartebeest 2017        Balanite plains          1   8.000000
## 39         Hartebeest 2018        Balanite plains          6   2.000000
## 40         Hartebeest 2018        Enoolera plains          1   1.000000
## 41         Hartebeest 2017       Rekero bushlands          2   1.500000
## 42         Hartebeest 2018       Rekero bushlands         14   3.214286
## 43             Impala 2017        Balanite plains         22   5.454545
## 44             Impala 2018        Balanite plains         29   4.689655
## 45             Impala 2017        Enoolera plains          9   2.111111
## 46             Impala 2018        Enoolera plains         11   5.090909
## 47             Impala 2017 Ilkisieusieu bushlands         11  17.545455
## 48             Impala 2018 Ilkisieusieu bushlands         40   7.625000
## 49             Impala 2017  Olare Sampu bushlands         30   3.166667
## 50             Impala 2018  Olare Sampu bushlands         76   7.263158
## 51             Impala 2017           Payia plains          1 100.000000
## 52             Impala 2018           Payia plains          7  22.285714
## 53             Impala 2017       Rekero bushlands         59   5.915254
## 54             Impala 2018       Rekero bushlands         67   8.716418
## 55             Impala 2017          Rekero plains          5   6.400000
## 56             Impala 2018          Rekero plains         10   3.800000
## 57             Impala 2017 Sampu Enkare bushlands         21   3.142857
## 58             Impala 2018 Sampu Enkare bushlands         59   6.016949
## 59       Plains zebra 2017        Balanite plains         19   2.894737
## 60       Plains zebra 2018        Balanite plains         27  21.407407
## 61       Plains zebra 2017        Enoolera plains          6   5.666667
## 62       Plains zebra 2018        Enoolera plains         22   4.500000
## 63       Plains zebra 2017 Ilkisieusieu bushlands         11   2.454545
## 64       Plains zebra 2018 Ilkisieusieu bushlands         46   9.500000
## 65       Plains zebra 2017  Olare Sampu bushlands          1  11.000000
## 66       Plains zebra 2018  Olare Sampu bushlands         42   5.380952
## 67       Plains zebra 2017           Payia plains         12   2.750000
## 68       Plains zebra 2018           Payia plains         60   5.016667
## 69       Plains zebra 2017       Rekero bushlands          5   2.000000
## 70       Plains zebra 2018       Rekero bushlands         58   5.568966
## 71       Plains zebra 2017          Rekero plains          5   3.200000
## 72       Plains zebra 2018          Rekero plains         12   8.916667
## 73       Plains zebra 2017 Sampu Enkare bushlands          5   4.800000
## 74       Plains zebra 2018 Sampu Enkare bushlands         46   5.152174
## 75  Thomson's gazelle 2017        Balanite plains         58   5.103448
## 76  Thomson's gazelle 2018        Balanite plains        108   6.796296
## 77  Thomson's gazelle 2017        Enoolera plains         46   7.369565
## 78  Thomson's gazelle 2018        Enoolera plains        111   7.540541
## 79  Thomson's gazelle 2017 Ilkisieusieu bushlands         12   4.250000
## 80  Thomson's gazelle 2018 Ilkisieusieu bushlands         52   6.538462
## 81  Thomson's gazelle 2017  Olare Sampu bushlands         13   7.615385
## 82  Thomson's gazelle 2018  Olare Sampu bushlands         44   4.386364
## 83  Thomson's gazelle 2017           Payia plains         69   4.855072
## 84  Thomson's gazelle 2018           Payia plains        169   6.923077
## 85  Thomson's gazelle 2017       Rekero bushlands         17   4.647059
## 86  Thomson's gazelle 2018       Rekero bushlands         30   4.933333
## 87  Thomson's gazelle 2017          Rekero plains         55   4.327273
## 88  Thomson's gazelle 2018          Rekero plains        106   7.669811
## 89  Thomson's gazelle 2017 Sampu Enkare bushlands         27   4.037037
## 90  Thomson's gazelle 2018 Sampu Enkare bushlands        116   6.362069
## 91               Topi 2017        Balanite plains          5   2.800000
## 92               Topi 2018        Balanite plains         33   4.606061
## 93               Topi 2017        Enoolera plains         11   2.363636
## 94               Topi 2018        Enoolera plains         26   2.230769
## 95               Topi 2017 Ilkisieusieu bushlands          1   4.000000
## 96               Topi 2018 Ilkisieusieu bushlands          7   3.285714
## 97               Topi 2017  Olare Sampu bushlands          3   1.333333
## 98               Topi 2018  Olare Sampu bushlands         17   2.882353
## 99               Topi 2018           Payia plains          1   1.000000
## 100              Topi 2017       Rekero bushlands         32   2.875000
## 101              Topi 2018       Rekero bushlands         46   1.891304
## 102              Topi 2017          Rekero plains          2   1.000000
## 103              Topi 2018          Rekero plains          8   2.750000
## 104              Topi 2017 Sampu Enkare bushlands          6   1.833333
## 105              Topi 2018 Sampu Enkare bushlands         33   1.848485
## 106           Warthog 2017        Balanite plains         10   1.900000
## 107           Warthog 2018        Balanite plains         31   2.419355
## 108           Warthog 2017        Enoolera plains          3   4.000000
## 109           Warthog 2018        Enoolera plains         16   2.500000
## 110           Warthog 2017 Ilkisieusieu bushlands          4   2.500000
## 111           Warthog 2018 Ilkisieusieu bushlands         13   2.307692
## 112           Warthog 2017  Olare Sampu bushlands          8   2.000000
## 113           Warthog 2018  Olare Sampu bushlands         44   2.704545
## 114           Warthog 2017           Payia plains          3   1.333333
## 115           Warthog 2018           Payia plains          8   2.500000
## 116           Warthog 2017       Rekero bushlands          3   1.000000
## 117           Warthog 2018       Rekero bushlands         11   1.909091
## 118           Warthog 2017          Rekero plains          1   1.000000
## 119           Warthog 2018          Rekero plains          1   1.000000
## 120           Warthog 2017 Sampu Enkare bushlands          7   2.000000
## 121           Warthog 2018 Sampu Enkare bushlands         43   3.395349
## 122        Wildebeest 2017        Balanite plains          9  17.000000
## 123        Wildebeest 2018        Balanite plains         54   8.851852
## 124        Wildebeest 2017        Enoolera plains         31   8.387097
## 125        Wildebeest 2018        Enoolera plains        113   9.761062
## 126        Wildebeest 2017 Ilkisieusieu bushlands          3   8.333333
## 127        Wildebeest 2018 Ilkisieusieu bushlands         42  14.690476
## 128        Wildebeest 2017  Olare Sampu bushlands         25  15.400000
## 129        Wildebeest 2018  Olare Sampu bushlands        121  11.619835
## 130        Wildebeest 2017           Payia plains         36   6.777778
## 131        Wildebeest 2018           Payia plains        241   8.780083
## 132        Wildebeest 2017       Rekero bushlands         28   5.821429
## 133        Wildebeest 2018       Rekero bushlands         73  28.452055
## 134        Wildebeest 2017          Rekero plains         18   4.055556
## 135        Wildebeest 2018          Rekero plains        126   6.007937
## 136        Wildebeest 2017 Sampu Enkare bushlands         53   7.226415
## 137        Wildebeest 2018 Sampu Enkare bushlands         92   8.163043